A quick intro to the intro to R Lesson Series


This ‘Intro to R Lesson Series’ is brought to you by the Centre for the Analysis of Genome Evolution & Function’s (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.

This lesson is the fourth in a 6-part series. The idea is that at the end of the series, you will be able to import and manipulate your data, make exploratory plots, perform some basic statistical tests, test a regression model, and make some even prettier plots and documents to share your results.


How do we get there? Today we are going to be learning data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. We will also be learning r markdown so that we can easily annotate our code and share it with others in reproducible documents. In the next lesson we will learn how to do t-tests and perform regression and modeling in R. And lastly, we will learn to write some functions, which really can save you time and help scale up your analyses.


The structure of the class is a code-along style. It is hands on. The lecture AND code we are going through are available on GitHub for download at https://github.com/eacton/CAGEF, so you can spend the time coding and not taking notes. As we go along, there will be some challenge questions and multiple choice questions on Socrative. At the end of the class if you could please fill out a post-lesson survey (https://www.surveymonkey.com/r/PVHDKDB), it will help me further develop this course and would be greatly appreciated.


Packages Used in This Lesson

The following packages are used in this lesson:

tidyverse (ggplot2, tidyr, dplyr)
(twitteR)*
(httr)*
tidytext
viridis
knitr
kableExtra
wordcloud

*Used to generate the tweet tables used in this lesson. It is not necessary for you to install this - you can work from the tables. If you want to create these files - the code is here - twitter scrape.

Please install and load these packages for the lesson. In this document I will load each package separately, but I will not be reminding you to install the package. Remember: these packages may be from CRAN OR Bioconductor.


Highlighting

grey background - a package, function, code or command
italics - an important term or concept
bold - heading or ‘grammar of graphics’ term
blue text - named or unnamed hyperlink


Objective: At the end of this session you will be able to use regular expressions to ‘clean’ your data. You will also learn R markdown and be able to render your R code into slides, a pdf, html, a word document, or a notebook.


Load libraries

Since we are moving along in the world, we are now going to start loading our libraries at the start of our script. This is a ‘best practice’ and makes it much easier for someone to reproduce your work efficiently by knowing exactly what packages they need to run your code. We will learn how to do this with a function in Lesson 6!

library("tidyverse")
library("tidytext")
library("viridis")
library("knitr")
library("kableExtra")
library("wordcloud")

Data Cleaning or Data Munging or Data Wrangling

Why do we need to do this?

‘Raw’ data is seldom (never) in a useable format. Data in tutorials or demos has already been meticulously filtered, transformed and readied to showcase that specific analysis. How many people have done a tutorial only to find they can’t get their own data in the format to use the tool they have just spend an hour learning about???

Data cleaning requires us to:

Some definitions might take this a bit farther and include normalizing data and removing outliers, but I consider data cleaning as getting data into a format where we can start actively doing ‘the maths or the graphs’ - whether it be statistical calculations, normalization or exploratory plots.

Today we are going to mostly be focusing on the data cleaning of text. This step is crucial to taking control of your dataset and your metadata. I have included the functions I find most useful for these tasks but I encourage you to take a look at the Strings Chapter in R for Data Science for an exhaustive list of functions. We have learned how to transform data into a tidy format in Lesson 2, but the prelude to transforming data is doing the grunt work of data cleaning. So let’s get to it!



Intro to regular expressions

Regular expressions

“A God-awful and powerful language for expressing patterns to match in text or for search-and-replace. Frequently described as ‘write only’, because regular expressions are easier to write than to read/understand. And they are not particularly easy to write.” - Jenny Bryan



So why do regular expressions or ‘regex’ get so much flak if it is so powerful for text matching?

Scary example: how to get an email in different programming languages http://emailregex.com/.

Regex is definitely one of those times when it is important to annotate your code. There are many jokes related to people coming back to their code the next day and having no idea what their code means.

There are sites available to help you make up your regular expressions and validate them against text. These are usually not R specific, but they will get you close and the expression will only need a slight modification for R (like an extra backslash - described below).

Regex testers:

https://regex101.com/
https://regexr.com/

What I would like to get across it that it is okay to google and use resources early on for regex, and that even experts still use these resources.







What does the language look like?

The language is based on meta-characters which have a special meaning rather than their literal meaning. For example, ‘$’ is used to match the end of a string, and this use supercedes its use as a character in a string (ie ‘Joe paid $2.99 for chips.’).

Matching by position

Where is the character in the string?

Quantifiers

How many times will a character appear?

Classes

What kind of character is it?

Operators

Helper actions to match your characters.

Escape characters

Sometimes a meta-character is just a character. Escaping allows you to use a character ‘as is’ rather than its special function. In R, regex gets evaluated as a string before a regular expression, and a backslash is used to escape the string - so you really need 2 backslashes to escape, say, a ‘$’ sign ("\\\$").

Trouble-shooting with escaping meta-characters means adding backslashes until something works.

Joking/Not Joking (xkcd)

Joking/Not Joking (xkcd)

While you can always refer back to this lesson for making your regular expressions, you can also use this regex cheatsheet.


Data Cleaning with Base R (AKA What is Elon Musk up to anyways?)

Let’s take this cacaphony of characters we’ve just learned about and perform some basic data cleaning tasks with an actual messy data set. I have scraped Elon Musk’s latest tweets from Twitter. The code to do this is in the Lesson 4 file twitter_scrape.R if you are curious or want to creep someone on Twitter.

Let’s read in the set of tweets, take a look at the structure of the data.

elon_tweets_df <- read.delim("data/elon_tweets_df.txt", sep = "\t", stringsAsFactors = F)
EOF within quoted string

The warning with EOF (end of file) within quoted string is possibly due to the fact that there are special characters (emojis, arrows, etc.) inside the cells. Let’s take a look at how the file was parsed.

str(elon_tweets_df)

Our end goal is going to be to look at the top 50 words in Elon Musk’s tweets and make a wordcloud. I don’t want urls, hastags, or other tags. I also don’t want punctuation or spaces. I just want to extract the words from tweets. It might be fun to look at the top favorite tweets while we are data cleaning, so let’s use tidyverse functions to keep the text tweets and order them by the favorited counts.

elon_tweets_df <- elon_tweets_df %>% 
  select(text, favoriteCount) %>%
  arrange(desc(favoriteCount))

elon_tweets_df$text[1:5]

First, I want to remove the tags from the beginning of words. I am going to save my regex expression into an object - so we can use them again later.

What this expression says is that I want to find matches for a hastag OR an asperand (‘at’ symbol) followed by at least one word character. grep is a function that allows us to match our pattern (our expression) to a character vector. It is a good idea to do a visual inspection of your result to make sure your matches or substitutions are working the way you expected.

tags <- "#|@\\w+"

grep(pattern = tags, x = elon_tweets_df$text)

We can see that grep returns the index of the match. We have a number of entries that include tags. We also have a number of warnings that we will return to.

If we want to return the tweet itself instead of the index, we can use the argument value = TRUE. In this case, it looks like each tweet matched does have a tag. (You will have a warning here too, I didn’t print it here.)

grep(tags, elon_tweets_df$text, value = TRUE) %>% head()

We can then use gsub to replace that pattern (our tags) with nothing (an empty string).

elon_tweets_df$text <- gsub(pattern = tags, replacement = "", elon_tweets_df$text)

Back to the warnings about strings being ‘invalid in this locale’. Let’s take a look at these strings by subsetting for the indices given.

elon_tweets_df$text[c(10,118, 156, 219, 224)]

From context, it looks like these character strings have emojis in them, which have their own character codes. Why would this give us an error? Tweets are encoded in UTF-16 and converted to UTF-8 when read into R. Things that have character codes get encoded differently. Here is an example of emoji encoding. Since we are going to remove anything with special character codes (ie. an apostrophe or emoji), we are going to use the iconv function to substitute encoded character codes that need converting with nothing (again, an empty character string). This is not something you will have to deal with on a daily basis, but character encoding is something to be aware of, especially when scraping data from the web.

elon_tweets_df$text <- iconv(elon_tweets_df$text, "UTF-8", "ASCII", sub = "")

elon_tweets_df$text[c(10,118, 156, 219, 224)]

Looking back at our problematic strings, you can see that the emojis have been removed as well as quotation marks. Our hastag and asperand would also have been encoded characters had we not already removed them.

Our next step would be to remove urls. This is a bit tricky. We could be looking for http:// or https:// followed by we don’t know what (some combination of letters, numbers and forward slashes).

We can check out which tweets have urls using grep as we did previously to see if we managed to match urls.

We are going to continue our pattern of using gsub to substitute what we don’t want with an empty character string.

url <- "http[s]?://[[:alnum:].\\/]+"

grep(url, elon_tweets_df$text, value = TRUE) %>% head()

elon_tweets_df$text <- gsub(pattern = "http[s]?://[[:alnum:].\\/]+", replacement = "", elon_tweets_df$text)

We can also use grepl to get a logical reponse for whether a tweet has a url or not. That way, if you wanted to grab all of the urls that Elon Musk suggests to visit, you can filter with grepl to select all of the tweets where it is TRUE that a url is present.

grepl(url, elon_tweets_df$text) %>% head()

elon_urls <- elon_tweets_df %>% filter(grepl(url, elon_tweets_df$text))

Lastly, we are going to get rid of trailing spaces, numbers, and punctuation all at the same time. You can find trailing spaces at the very end of our tweet string from removing the urls.

trail <- "[ ]+$|[0-9]*|[[:punct:]]"

grep(trail, elon_tweets_df$text, value = TRUE) %>% head()

We can check to see that we are picking up strings with punctutation, numbers and trailing spaces, and then we can remove them and compare our output.

elon_tweets_df$text <- gsub(pattern = trail, replacement = "", elon_tweets_df$text)

elon_tweets_df$text[1:5]

It looks like everything worked except there are extra spaces from whenever a number was removed. Let’s take all of the places where there are 2 or more spaces created and substitute them with just one space.

space <- "\\s{2,}"

grep(space, elon_tweets_df$text, value = TRUE) %>% head()

Again, we can check to see that we are picking up strings with extra spaces, and then replace those spaces with a single space.

elon_tweets_df$text <- gsub(pattern = space, replacement = " ", elon_tweets_df$text)

elon_tweets_df$text[1:5]

It worked!


Challenge

We also have a leading whitespace where we removed a number. How would we remove that whitespace? Can you think of more than one way to do this?






Onwards!! Let’s break the tweets down into individual words, so we can see what the most common words used are. We can use the base R function strsplit to do this; in this case we want to split our tweets into words using spaces.

strsplit(elon_tweets_df$text, split = " ") %>% head()

Note that the output of this function is some horrible nested list object.

Luckily there is an unlist function which recursively will go through lists to simplify their elements into a vector. Let’s try it and check the structure of our output. We will save this to an object called ‘words’.

unlist(strsplit(elon_tweets_df$text, split = " ")) %>% head(20)

words <- unlist(strsplit(elon_tweets_df$text, split = " "))

Our output is now a long character vector. This will make it much easier to count words.

str(words)

Let’s take a peak at the words.

tail(words)

Great! But… we missed some \n (newline) and \t (tab) characters. These are not punctuation characters.


Challenge

Newline and tab characters are separating 2 words. Split these words apart and get rid of the newline character. Convert all of our character strings to lowercase (I haven’t shown you how to do this, but I believe in your google-fu). Check the first and last 50 words to see if anything else is amiss.





There are still a few problems with words cutoff like ‘solv’, or ‘flamethrower’ and ‘flamethrowers’ being the same word, or ‘north’ and ‘korea’ belonging together for context. If we were serious about this dataset we would need to resolve these issues. We also have some html and twitter-specific tags that we will deal with shortly.

Let’s move ahead and count the number of occurences of each word and order them by frequency. We do this using our dplyr functions (Lesson 2).

data.frame(words) %>% count(factor(words)) %>% arrange(desc(n))

Wow. We have discovered people use prepositions and conjunctions. There are also words unrelated to content but that are html jargon, or things like ‘na’ and ‘false’.

Luckily text mining is an area of data analytics in full force and there is a list of ‘stop words’ that can be used to get rid of words that are unlikely to contain useful information as part of the tidytext package. However, we will have to add to this list.

The data that comes with the package is called stop_words. We can save it as an object and take a look at its structure.

stop_words <- stop_words
str(stop_words)

We can then add rows to this data frame with words our own stop words. Remember that to bind_rows data frames together, the column names have to match. We can make a small data frame and call our lexicon ‘custom’. Note that I have written ‘custom’ once - it will recycle as a character vector of length 1 to the length of the data frame.

add_stop <- data.frame(word = c("na", "false", "href", "rel", "nofollow", "true", "amp", "twitter", "iphonea", "relnofollowtwitter", "relnofollowinstagrama"), 
                       lexicon = "custom", stringsAsFactors = FALSE)

stop_words <- bind_rows(stop_words, add_stop)

To remove these stop words from our list of words from tweets, we perform an anti-join (from Lesson 3).

words <- anti_join(data.frame(words), stop_words, by=c("words" = "word"))

Let’s look at our top words by count now, and save this order.

words %>% count(words) %>% arrange(desc(n))

words <- words %>% count(words) %>% arrange(desc(n))

‘boring’, ‘falcon’, ‘tesla’, ‘rocket’, ‘launch’,‘flamethrower’, ‘cars’, ‘spacex’, ‘tunnels’, and ‘mars’ and ‘ai’ are a bit further down the list. There are a few words that look like they should be added to the ‘stop words’ list (dont, doesnt, didnt, im), but we’ll work with this for now.

We can make a word cloud out of the top 50 words, which will be sized according to their frequency. I am starting with the first word after Elon Musk’s twitter handle. The default color is black, but we can use our viridis package (Lesson 3) to have a pleasing color palette. It is okay if this code gives you a warning that not all words can be fit on the page, this can be changed by adjusting the scale argument.


Data Cleaning with stringr/stringi (AKA What is Trump up to anyways?)

We are going to go through the same data cleaning process with the stringr package using Trump’s tweets. The syntax is a little different, but it is pretty intuitive once you get started. All stringr functions can be found using str_ + Tab. Again, we will start by loading the dataset and looking at the top 5 favorite tweets. We will remove all encoded character codes right away.

trump_tweets_df <- read.delim("data/trump_tweets_df.txt", sep = "\t", stringsAsFactors = FALSE)
trump_tweets_df$text <- iconv(trump_tweets_df$text, "UTF-8", "ASCII", sub = "")

trump_tweets_df <- trump_tweets_df %>% select(text, favoriteCount) %>% arrange(desc(favoriteCount)) 
trump_tweets_df$text[1:5]

The first thing that we did was look for tags. The order of arguments are switched in stringr relative to the base functions. The first argument will be the character string we are searching, and the second argument will be the pattern we are matching. str_extract will return the index of the match, as well as the match. This is similar to grep when value = TRUE. Note that the match is extracted rather than the entire string.

str_extract(string = trump_tweets_df$text, pattern = tags) %>% head(100)

str_detect is similar to grepl returning TRUE or FALSE if a match is or isn’t found, respectively.

str_detect(trump_tweets_df$text, tags)

Let’s remove our urls as before. With the str_replace function we can specify our pattern and replacement, in this case an empty character string. We can see in the result that the urls have been replaced.

str_replace_all(trump_tweets_df$text[1:10], pattern = url, replacement = "")
trump_tweets_df$text <- str_replace_all(trump_tweets_df$text, pattern = url, replacement = "")

Let’s be ambitious and try to remove tags, numbers and punctuation characters and numbers all in one go. str_remove automatically replaces the match with an empty character string. It turns out the @ and # are punctuation characters, so removing them is taken care of using [[:punct:]]. We also want to remove the metacharacter $ (which is not considered punctuation. We aren’t sure what order the numbers and punctuation might come in and square brackets allow ANY characters inside the brackets to be matched. We are not sure if there will be zero, one, or many of our target characters in a tweet, however str_remove_all() will remove every instance of this pattern (otherwise we would use the * outside the brackets to indicate 0 or more times). Looking at the output, we can see that the numbers and punctuation and dollar signs are indeed removed.

clean_all <- "[[0-9][[:punct:]]\\$]"

trump_tweets_df$text <- str_remove_all(trump_tweets_df$text, pattern = clean_all)

trump_tweets_df$text[1:10]

As expected, we still have trailing spaces. Whitespace characters are not visible, but take up space. Newline characters, tabs and spaces are a form of whitespace. stringr has its own function for trimming whitespace, str_trim, which you can use to specify whether you want leading or trailing whitespace trimmed, or both.

trump_tweets_df$text <- str_trim(trump_tweets_df$text, side = "both")

trump_tweets_df$text[1:10]

See how we have a couple extra spaces in the middle of some of our strings? str_squish will take care of that for us, leaving only a single space between words.

trump_tweets_df$text <- str_squish(trump_tweets_df$text)

trump_tweets_df$text[1:10]

All that’s left is to convert all characters to lowercase, and then we can see the top Trump words!

trump_tweets_df$text <- tolower(trump_tweets_df$text)

trump_tweets_df$text[1:10]

To get our tweets into a word list we use str_split, a similar function to strsplit, still splitting by the spaces between words. The argument simplify = FALSE returns a list of character vectors which we then unlist.

str(words)
 chr [1:8766] "crazy" "joe" "biden" "is" "trying" "to" "act" "like" "a" "tough" "guy" "actually" "he" "is" ...

We can now do our anti_join to remove ‘stop words’, and tally our remaining words and order them by descending counts.

words <- anti_join(data.frame(words), stop_words, by=c("words" = "word"))

words %>% count(words) %>% arrange(desc(n)) 

words <- words %>% count(words) %>% arrange(desc(n))

Hmmm… it looks like we have those html tags in a different format. It’s interesting to note these little variations because no matter how much you try to automate your analysis there is always going to be something from your new dataset that didn’t fit with your old dataset. This is why we need these data wrangling skills. Even though some packages may have been created to help us on our way, they can’t possibly cover every case.



















We could go back and get rid of some of characters such as <, however we don’t want to lose sight that these are html tags and not words (the tweet was from an ipad or iphone, the ‘word’ isn’t being mentioned). We will instead add these to our stop words list.

add_stop <- data.frame(word = c("rel=nofollow>twitter", "href=", "iphone<a>", "<a","dont", "$", "href=downloadipad", "ipad<a>" ), 
                       lexicon = "custom", stringsAsFactors = FALSE)


stop_words <- bind_rows(stop_words, add_stop)

We then perform an anti_join with our new list and view the updated version. (words was already sorted and so we do not need to do that again.)

‘president’, ‘people’, ‘fake’, ‘news’, ‘daca’, democrats’, ‘jobs’, ‘obama’, ‘border’, ‘fbi’, ‘collusion’, ‘russia’, ‘wall’, ‘mexico’ and further down is ‘crooked’ and ‘hillary’.

Trump’s wordcloud minus his twitter handle.


Challenge $

Pick one of the other tweet data sets:

Bill Nye, Justin Trudeau, The Daily Show, Katy Perry, Jimmy Fallon, Stephen Colbert.

Clean it. Remove all of the stop words. Were there any other challenges compared to the previous datasets? Did you have to create new stop words or do extra regex? Make a wordcloud of the top 50 words.



Rmarkdown and knitr

Markdown is a plain text formatting syntax. It allows one to easily add headings, lists, links, highlighting, bullets, images, equations, tables and text styling. R has a modified version of markdown (R markdown) where you can embed code chunks into a document. Combined with the knitr package, this allows us to make reproducible documents. The awesomeness of this combination allows us to annotate our code while we work in a format that is immediately presentable. With the click of a button our script (or notebook) can be converted into a word document, a pdf, or a shareable html hyperlink. Other formats such as slides are also possible. This means that you only have to do your work once - you don’t have to have your code, generate images, paste them into powerpoint or word - get asked to change something, rerun the code, get the figure, change your powerpoint… everything is all in one place - you make your change, knit your document and you are done - you don’t have to leave RStudio.

Markdown and knitr can also save us time scrolling through our code - a table of contents can be added, code chunks can be named - both of which allow us to jump around and navigate our script easily. It takes a bit of discipline to get started, but I believe that you will see the benefits pretty quickly. There are even markdown templates for submitting to different journals or writing a thesis.

For this lesson I suggest going to Tools -> Global Options -> R Markdown -> Show output preview in: and change from ‘Window’ to ‘Viewer Pane’, then click ‘Apply’ and ‘OK’. This will allow us to see our new document in the same window (in the Viewer) instead of switching back and forth between our code and the document in a separate window.

R markdown syntax

Let’s start by creating a new R markdown document by going to File -> New File -> R Markdown. R will ask you for the Title of your document, the Author and whether you want to render your markdown as html, pdf, or as a word document. There is also the option to make slides (under Presentation), a Shiny app, or use Templates for package documentation or GitHub (there is a git version of markdown that differs slightly from R markdown). html renders faster than the other formats, so we will stick with that and click OK. R immediately puts a yaml (yet another markup language, yaml ain’t markup language) header which tells you how the file is configured.

---
title: "R markdown Lesson"
author: "EA"
date: "April 11, 2018"
output: html_document
---

As you can see, the date has been added (this is the date of script creation and does not update when the script is rendered), and the output is going to be html. Note also that your document is Untitled. We can go ahead and save that as ‘Rmarkdown_Lesson’ and see that the file is saved as a .Rmd file.

We can see that R has a little demo set up already which we are going to work with. The new element in .Rmd files are these code chunks denoted by a set of 3 backticks, followed by {r}, some code and a closing set of 3 backticks. The keyboard shortcut for generating a code chunk is CTRL + ALT + I.

```{ r name_of_chunk, code_options}

type code here

```

These code chunks have been named ‘setup’, ‘cars’, and ‘pressure’, and at the bottom left the source pane (or if you go to Code -> Jump To... it will pop up) you can navigate between these code chunks. This is a helpful feature as your code gets a bit longer.

You have probably also noticed that in this navigation bar the bolded items correspond to the title of the document and the text with a leading ##. Hashtags (when outside code chunks, ie. in markdown language) denote headers. The number of hashtags denotes the level of the header as well as the size. For example, the title is a first level header, and ‘R Markdown’ and ‘Including Plots’ are second level headers, and will also be smaller than the title. Let’s go ahead and knit our document by clicking the Knit button. Note that you can change from your default output choice (html) to Word or pdf in the dropdown menu.

Let’s look at how markdown is rendered in our html file. We can see that to bold text you can have two asteriks (**) or two underscores (__) on either side of the text.

You can insert a url by typing the url inside of arrow brackets <http://rmarkdown.rstudio.com>. If you want the link to be named ‘rmarkdown’ can format it like this [rmarkdown] (http://rmarkdown.rstudio.com) without the space inbetween the name and the url. Replace the url with the named version and click knit to see the difference in the output.

The other emphasized text in this document has a grey background. This is achieved by flanking the text with `backticks`. The code in the ‘cars’ chunk has a grey background and the evaluated output is in white. This is standard in the R community for the code to be grey and the output to be white and commented. Inline code can be written by using {r 2+2} and it will also have a grey background.

Rmarkdown

To make a bulleted list:

* you need to leave a line before 
* the text and the start of your bullets and a 
* space between the asterik (bullet) and your text.

Rendered

To make a bulleted list:

  • you need to leave a line before
  • the text and the start of your bullets and a
  • space between the asterik (bullet) and your text.

Rmarkdown + Rendered

To make a numbered list:

  1. you need to leave a line before
  2. the text and the start of your numbers and a
  3. space between the numbers and your text.

Rmarkdown

To make a super-cool updatable numbered list:

1. you need to do the above as with numbered lists
1. but all of the numbers are numbered '1.' 
1. you can now add, remove and reorder and your numbers will update.

Rendered

To make a super-cool updatable numbered list:

  1. you need to do the above for numbered lists
  2. but all of the numbers are numbered ‘1.’
  3. you can now add, remove and reorder and your numbers will update.

Rmarkdown

If ever your text
is clumping together
when you do not expect it to,
remember that you need 5 spaces at at the end of a line
to start a new line.

Rendered

If ever your text is clumping together when you do not expect it to, remember that you need 5 spaces at at the end of a line for a new line to start.

Rmarkdown

    A text box can be created by indenting with Tab twice.

Rendered

A text box can be created by indenting with Tab twice.

Rmarkdown

A line across the page is ’***’.

Rendered


Knitr Chunk Options

You might notice that while there are 3 code chunks in this example, there is one line of code visible in the rendered version, and 2 outputs (summary statistics and a plot). Why don’t we see the code used to make the plot? Code chunks have options that can be entered to modify their output. In this case the inclusion of echo = FALSE prevents the code from being included, but the code is still run and so the plot is still produced. Try changing the code chunk option for the plot to eval = FALSE. What happened?

{r pressure, eval = FALSE}

plot(pressure)

eval = FALSE means the code is shown, but not evaluated.

You can specify which lines of code in a chunk get evaluated. For example if you had 5 lines of code, but only wanted to run the first and third, you could use eval = c(1,3). This feature of using a vector of position to specify code is available for other chunk options such as echo.

The first code chunk in this script is setting default options for all code chunks to be used in this script. In this case echo = TRUE was set as a default chunk option. The option include = FALSE for this chunk means that the code will not be included, but the code will still be run. The difference between this command and echo = FALSE is that the output of the code is NOT shown.

Let’s look at what the default chunk options are - this way we will be able to see all of the options available to change.

{r setup, echo = TRUE }

str(knitr::opts_chunk$get())
List of 1
 $ error: logi FALSE

There are a ton of options here. You can guess that some of them have to do with default figure sizes and labels, and there a bunch of options that are not specified (NULL).

As far as setting chunk options at the beginning of a script goes, consider the following:

{r}

library(tidyverse)

If we are creating a document, we may want to show what package we used, but we don’t want all of the package startup messages. With message = FALSE the code will run and be shown but any messages generated will be suppressed.

{r message = FALSE}

library(tidyverse)

I could use include = FALSE, message = FALSE if I wanted the library loaded but didn’t want the code or its message to be seen.

If I wanted to do something silly like add 6 to every summary value (in truth each of these summary values is a character) it would generate an error. A document will not be rendered if it has an error in it. Try to knit the document with this code.

{r error = TRUE}

summary(cars) + 6

Adding the option error = TRUE allows the document to be rendered despite the error. The error message will still be shown.


Challenge

For the code chunk containing plot(pressure): How would you show just the code (not the plot) and not run the code? How would you show just the code and have the output run but not show it? How would you change the background color to something other than gray? You can use help pages, Google, or the knitr documentation found here: https://yihui.name/knitr/options/#chunk_options




Caching

In the ‘Run’ dropdown menu (found in the top right of the source pane), there are various options for running your current chunk - CTRL+SHIFT+ENTER, the next chunk - CTRL+ALT+N, all chunks above - CTRL+ALT+P, all chunks below, and another few options. This allows you to assess the upstream and downstream consequences of a change in your code. knitr also has the option to cache the output of code chunks by setting the option cache = TRUE. A folder will be created that saves the output of your chunk in a data file. knitr accesses the cache and loads the result from the last time the chunk was run without recalculating values. This can be very useful if the code in a particular chunk takes awhile to run and you are assessing changes unrelated to that code, or changes after that code.

For example, if your document isn’t knitting because of an error at line 200 and your time intensive code runs at line 100, you can cache the line 100 chunk and troubleshoot the line 200 code without having to wait for this earlier chunk to run again. The caching caveat is that changing anything in earlier code (at line 50 in this example) that your cached chunk depends on would not be appropriately updated (ie. the code at line 100 would still not change). Therefore it is important to be conscious of what you are caching and where changes are occurring in your script. You should uncache your code chunk for the final rendering to make sure there haven’t been any unforseen changes to your document.

This is a simplified explanation of caching and more details can be found in the knitr manual and its cache demo.

Playing with Caching

Note to David: These caching scenarios will partly be on Socrative. There will be multiple choice answers about the output and students vote on which is correct. This will allow me to track comprehension across the series. Therefore this text will be in the online version of the notes.

Scenario 1

With our current .Rmd file, let’s say the ‘summary’ chunk took awhile to run. Let’s add cache = TRUE to its chunk options. Make sure eval = FALSE has been removed from the options in your ‘plot’ chunk. Knit the document and take a look.

We are now going to change the plot chunk to depend on the cars dataset.

{r cars}

plot(cars$speed, cars$dist)

We can knit the document again, and assume that we saved ourselves the time cost of running the second chunk when we are just updating a plot.

Now let’s put a chunk before our cached chunk called ‘new row’. Add a point to the cars dataset using dplyr’s bind_rows. Note that I haven’t loaded the entire dplyr package, but rather have just made a call to one specific function. Knit the document again.

{r new row}

cars <- dplyr::bind_rows(cars, c(speed = 50, dist = 200))

This point is an outlier, and the change can be seen on the output of our plot. However, our summary also depends on the cars dataset and has not been updated (ie. the maximum distance is still 120 km). If the code of the cached chunk does not change, the chunk is not rerun.

Scenario 2

If I change the cached chunk, say, by adding another outlier data point - what do you think will happen? Add this point and knit again.

{r cars, message = FALSE, cache = TRUE}

cars <- dplyr::bind_rows(cars, c(speed = 100, dist = 300))
summary(cars)

Since the code in the cache changed, its values were recalculated and the max distance in the summary has changed to 300 km. Changes in the the cached chunk are evaluated and passed on to the next code block. The previous cache values are deleted and replaced with the current values. The plot, which depends on the cars dataset, now has a point at dist = 300 km AND dist = 200 km.

Scenario 3

If I change cars (with the outlier point) in the ‘new row’ chunk back to cars (without the outlier point) and knit the document, what do you expect to happen?

{r new row}

cars <- cars

Since the code for our cached chunk doesn’t change its values are loaded from the cache. This includes the cars dataset, and so the plot, downstream of our cache, does not reflect the change to the ‘new row’ chunk.

Scenario 4

What if we instead comment the outlier point in the cached chunk and keep the extra point from the first chunk?

{r new row}

cars <- cars
cars <- dplyr::bind_rows(cars, c(speed = 50, dist = 200))

{r cars, message = FALSE, cache = TRUE}

#cars <- dplyr::bind_rows(cars, c(speed = 100, dist = 300))
summary(cars)

The code in the cache changed and so the summary was recalcuated using the earlier chunk. Both the summary and the plot show the 200 km distance value.

Scenario 5

What do you expect to happen if both outlier points are commented out and we restart the R session and knit the document? What is the max dist value in the summary table? What about the plot - does it contain the outlier?

{r new row}

cars <- cars
#cars <- dplyr::bind_rows(cars, c(speed = 50, dist = 200))

{r cars, message = FALSE, cache = TRUE}

#cars <- dplyr::bind_rows(cars, c(speed = 100, dist = 300))
summary(cars)

Since the code in the cached chunk did not changed, it accesses the summary from the cached data, which has 200 km as the maximum value. However, the plot needs to be created from scratch and uses cars from the first chunk so has a max of 120 km.

Scenario 6

What if we remove cache=TRUE from our code chunk?

{r cars, message = FALSE}

#cars <- dplyr::bind_rows(cars, c(speed = 100, dist = 300))
summary(cars)

The cache is no longer accessed. The summary table and the plot reflect the original cars dataset with a max of 120 km.

Scenario 7

What if we add cache=TRUE to our code chunk again?

{r cars, message = FALSE, cache = TRUE}

#cars <- dplyr::bind_rows(cars, c(speed = 100, dist = 300))
summary(cars)

The cache files are kept on the computer. The summary table reflects the previously cached value of 200 km, the plot is calculated from the first chunk and has a max of 120 km. The cache will stay this value until it is updated. To start with a fresh cache in the same directory you need to delete your cached files.

Tables

You can make table in markdown, but it is kind of annoying compared to using the kable package to make tables in knitr. Making tables in markdown involves using a series of pipes (|) to make columns and hypens (-) to make column headers. Here is an example of a markdown table. Here is an example of how to format markdown tables: https://help.github.com/articles/organizing-information-with-tables/. Go nuts.

  |Summary      | Values|
  |-------------|-------|
  | correlation |   0.8068949|
  | mean km/h   |    15.4|
  | mean km     |    42.98|
 
Summary Values
correlation 0.8068949
mean km/h 15.4
mean km 42.98

Rounding Values

The value for ‘correlation’ in this table was really the output of ‘r cor(cars$speed, cars$dist)’ which gives the value of 0.8068949. To round values is as easy as selecting the number of significant digits using the round function: ‘r round(cor(cars$speed, cars$dist), 2)’ would then give 0.81 two significant digits. The kable tables were are going to work with have a digits argument which gets passed to the round function (usage: digits = 2).

Kable tables in knitr

In this lesson we are going to focus on nice looking kable tables, which are easily customizable through the kableExtra package. The summary stats from cars is a table. However, it doesn’t look very good. Here is a reminder of the default output.

summary(cars)

This summary is an odd table object. If we turn it into a data frame it will be easier to work with.

dat <- data.frame(speed = summary(cars)[,1], distance = summary(cars)[,2])

Here, a simple call to kable creates a table styled similar to the above markdown.

kable(dat)

A variety of styles are offered with simple syntax. Here we have striped rows which highlight when you hover over them, the table width iis the length of the longest text and not across the whole page, and the table is left-aligned.

kable(dat, "html")  %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, position = "left")

You can also have the table move to the left or right side of your document so that text or a figure could be included beside it. In this case, having the table float right allows for text or images to be formatted on the left side of the page. For example, we could change the figure size as well and have a figure and table side-by-side in our document.


kable(dat, "html", escape = F)  %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, position = "float_right") 

In this case, I shrank the plot using out.width and out.height so that it would fit beside our table.

{r pressure, echo = FALSE, fig.width=6, fig.height = 5, out.width='50%', out.height='50%'}

par(mar = c(4,4,1,0)) #adjusting figure margins
plot(cars$speed, cars$dist)



Customize with highlighting and borders.

For this table black lines have been specified as column borders. A row was specified to be highlighted by a yellow background as well as to have the text emphasized in bold. escape = FALSE escapes specical characters. In this case it interferes with our column titles.

kable(dat, "html") %>%
       kable_styling("striped", full_width = FALSE) %>%
       column_spec(1:2, border_right = TRUE, border_left = TRUE) %>%
       row_spec(3, bold = T, color = "black", background = "yellow") 
Add footnotes.

Footnotes can be added to a table using symbols or alphabet markers for flags.

This is a good time to learn more useful data cleaning functions paste and paste0. These are made to join or ‘paste’ string characters together. In this case we want to take a character string (the title of each column of our data frame) and add a footnote symbol to it to denote units. Can you tell what the difference is between the 2 functions by the output?

colnames(dat)[1] <- paste("car_", colnames(dat)[1], footnote_marker_symbol(1))
colnames(dat)[2] <- paste0("car_", colnames(dat)[2], footnote_marker_alphabet(1))

Escape has been changed to FALSE so that the html encoding of our superscript is not escaped. The legend for the footnote symbol or character below the table is also added in out kable call.

kable(dat, "html", escape = FALSE) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1:2, border_right = TRUE, border_left = TRUE) %>%
  row_spec(3, bold = T, color = "black", background = "yellow") %>%
  footnote(symbol = "kilometers per hour", alphabet = "kilometers;") 

Adding Images to your Document

To add pictures to your document:

! [#caption (optional)]   (#directory/file)     {#size (optional)}     

![knitr - get it?](img/kitten-with-string.pjg){width=400px}

knitr - get it?

knitr - get it?

Minimum syntax to add an image (no caption, default image size):
![](img/kitten-with-string.jpg)

Table of contents

This is the yaml header including the table of contents (toc) for the lessson. It is as simple as writing toc = TRUE under the output for the document type you are using and then specifying what level of headers (remember our hashtags) you would like to include in the toc. I am keeping 1st, 2nd, and 3rd level headers in this example. If I had a 4th level header, it would still be larger than my text, but it will not show up in my table of contents. The toc creates a hyperlink to each section for the user to navigate the document. CTRL+SHIFT+O opens the document outline which allows navigation to these sections while coding.

---
title: "Lesson 4 - Of Data Cleaning and Documentation - Conquer Regular Expressions, Use R markdown and knitr to make PDFs, and Challenge yourself with a 'Real' Dataset"
output: 
  html_document:
          keep_md: yes
          toc: TRUE
          toc_depth: 3
  html_notebook:
          toc: TRUE
          toc_depth: 3
---

You may have noticed the blue button that kind of looks like an eyeball in the top right corner of the Viewer Pane as well as the Source Pane with a dropdown that says ‘Publish’. If you are super-proud of your work, you can post your rendered document for free, for the world to see at Rpubs. It can be interesting to see what other people in the R community have been working on as well.

Slides

Slideshows can also be made fairly simply in R markdown. Go to File -> New File -> R Presentation and create an .RPres file. Slides are separated by a series of equals lines (===) and the title of the slide is just above these lines.

  First Slide
  ========================================================

  For more details on authoring R presentations please visit <https://support.rstudio.com/hc/en-us/articles/200486468>.

  - Bullet 1
  - Bullet 2
  - Bullet 3

Slide With Code
========================================================
```r
summary(cars)
```
Slide With Plot
========================================================
```r
plot(cars)
```

If you click on ‘Preview’ in the Source Pane, a Presentation Tab will open in the Environment Pane with a a slideshow that you can toggle through. In that Pane under ‘More’ you can also ‘View in Browser’ or ‘Save As Webpage’, which is the common way these slides get presented.

I really just wanted to show you that these slides exist. Depending on what you are presenting, this could be a quick alternative to Powerpoint if you are need to present some code. Again, these are customizable https://rmarkdown.rstudio.com/ioslides_presentation_format.html.

If you are interested in a separate tutorial on making and customizing ioslides or the fancier Slidify slides, please leave a comment in the Lesson 4 survey (https://www.surveymonkey.com/r/PVHDKDB).


A Real Messy Dataset

I looked for a messy dataset for data cleaning and found it in a blog titled:
“Biologists: this is why bioinformaticians hate you…”

The main and common issue with this dataset is that when data entry was done there was no structured vocabulary; people could type whatever they wanted into free text answer boxes instead of using dropdown menus with limited options, giving an error if something is formatted incorrectly, or stipulating some rules (ie. must be all lowercase, uppercase, no numbers, spacing, etc).

I must admit I have been guilty of messing with people who have made databases without rules. For example, giving an emergency contact, there was a line to input ‘Relationship’, which could easily have been a dropdown menu: ‘parent, partner, friend, other’. Instead I was allowed to write in a free text box ‘lifelong kindred spirit, soulmate and doggy-daddy’. I don’t think anyone here was trying to be a nuisance, this messy data is just a consequence of poor data collection.

Challenge:

This is Wellcome Trust APC dataset on the costs of open access publishing by providing article processing charge (APC) data.

https://figshare.com/articles/Wellcome_Trust_APC_spend_2012_13_data_file/963054

What I want to know is:

  1. List 3 problems with this dataset that require data cleaning.
  2. What is the mean cost of publishing for the top 3 most popular publishers?
  3. What is the number of publications by PLOS One in dataset?
  4. Convert sterling to CAD. What is the median cost of publishing with Elsevier in CAD?
  5. Annotate your data cleaning efforts and answers to these questions in an .Rmd file. Knit your final answers to pdf.

The route I suggest to take in answering these question is:

There is a README file to go with this spreadsheet if you have questions about the data fields.


The blogger’s opinion of cleaning this dataset:

‘I now have no hair left; I’ve torn it all out. My teeth are just stumps from excessive gnashing. My faith in humanity has been destroyed!’

Don’t get to this point. The dataset doesn’t need to be perfect. No datasets are 100% clean. Just do what you gotta do to answer these questions.

We can talk about how this went at the beginnning of next week’s lesson.


Resources:
http://stat545.com/block022_regular-expression.html
http://stat545.com/block027_regular-expressions.html
http://stat545.com/block028_character-data.html
http://r4ds.had.co.nz/strings.html http://www.gastonsanchez.com/Handling_and_Processing_Strings_in_R.pdf
http://varianceexplained.org/r/trump-tweets/
http://www.opiniomics.org/biologists-this-is-why-bioinformaticians-hate-you/
https://figshare.com/articles/Wellcome_Trust_APC_spend_2012_13_data_file/963054
http://www.datacommunitydc.org/blog/2013/08/fantastic-presentations-from-r-using-slidify-and-rcharts/
https://github.com/rdpeng/cachesweave/blob/master/inst/doc/cacheSweave.Rnw
http://emailregex.com/
https://regex101.com/
https://regexr.com/
https://www.regular-expressions.info/backref.html
https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
https://raw.githubusercontent.com/today-is-a-good-day/Emoticons/master/emDict.csv
http://rmarkdown.rstudio.com
https://yihui.name/knitr/options/#chunk_options
https://www.cs.bham.ac.uk/~axj/pub/teaching/2016-7/stats/knitr-manual.pdf
https://yihui.name/knitr/demo/cache/
https://help.github.com/articles/organizing-information-with-tables/
https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html#getting_started
https://rmarkdown.rstudio.com/ioslides_presentation_format.html
https://rpubs.com/
https://www.jvcasillas.com/slidify_tutorial
http://r4ds.had.co.nz/strings.html

Post-Lesson Assessment


Your feedback is essential to help the next cohort of trainees. Please take a minute to complete the following short survey: https://www.surveymonkey.com/r/PVHDKDB




Thanks for coming!!!

---
title: "Lesson 4 - Of Data Cleaning and Documentation - Conquer Regular Expressions, Use R markdown and knitr to make PDFs, and Challenge yourself with a 'Real' Dataset"
output: 
  html_document:
          keep_md: yes
          toc: TRUE
          toc_depth: 3
  html_notebook:
          toc: TRUE
          toc_depth: 3
---
***
![](img/big-data-borat.png){width=400px} 

</br>

##A quick intro to the intro to R Lesson Series

</br>

This 'Intro to R Lesson Series' is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology. 



This lesson is the fourth in a 6-part series. The idea is that at the end of the series, you will be able to import and manipulate your data, make exploratory plots, perform some basic statistical tests, test a regression model, and make some even prettier plots and documents to share your results. 


![](img/data-science-explore.png)

</br>

How do we get there? Today we are going to be learning data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. We will also be learning r markdown so that we can easily annotate our code and share it with others in reproducible documents. In the next lesson we will learn how to do t-tests and perform regression and modeling in R. And lastly, we will learn to write some functions, which really can save you time and help scale up your analyses.


![](img/spotify-howtobuildmvp.gif)

</br>

The structure of the class is a code-along style. It is hands on. The lecture AND code we are going through are available on GitHub for download at https://github.com/eacton/CAGEF, so you can spend the time coding and not taking notes. As we go along, there will be some challenge questions and multiple choice questions on Socrative. At the end of the class if you could please fill out a post-lesson survey (https://www.surveymonkey.com/r/PVHDKDB), it will help me further develop this course and would be greatly appreciated. 

***

####Packages Used in This Lesson

The following packages are used in this lesson:

`tidyverse` (`ggplot2`, `tidyr`, `dplyr`)     
(`twitteR`)\*     
(`httr`)\*     
`tidytext`     
`viridis`     
`knitr`     
`kableExtra`     
`wordcloud`     

*Used to generate the tweet tables used in this lesson. It is not necessary for you to install this - you can work from the tables. If you want to create these files - the code is here  - [twitter scrape](https://github.com/eacton/CAGEF/blob/master/Lesson_4/twitter_scrape.R).    

Please install and load these packages for the lesson. In this document I will load each package separately, but I will not be reminding you to install the package. Remember: these packages may be from CRAN OR Bioconductor. 


***
####Highlighting

`grey background` - a package, function, code or command      
*italics* - an important term or concept     
**bold** - heading or 'grammar of graphics' term      
<span style="color:blue">blue text</span> - named or unnamed hyperlink     

***
__Objective:__ At the end of this session you will be able to use regular expressions to 'clean' your data. You will also learn R markdown and be able to render your R code into slides, a pdf, html, a word document, or a notebook.

***

####Load libraries

Since we are moving along in the world, we are now going to start loading our libraries at the start of our script. This is a 'best practice' and makes it much easier for someone to reproduce your work efficiently by knowing exactly what packages they need to run your code. We will learn how to do this with a function in Lesson 6!

```{r message = FALSE}
library("tidyverse")
library("tidytext")
library("viridis")
library("knitr")
library("kableExtra")
library("wordcloud")

```

***

##Data Cleaning or Data Munging or Data Wrangling

Why do we need to do this?

'Raw' data is seldom (never) in a useable format. Data in tutorials or demos has already been meticulously filtered, transformed and readied to showcase that specific analysis. How many people have done a tutorial only to find they can't get their own data in the format to use the tool they have just spend an hour learning about???

Data cleaning requires us to:

- get rid of inconsistencies in our data. 
- have labels that make sense. 
- check for invalid character/numeric values.
- check for incomplete data.
- remove data we do not need.
- get our data in a proper format to be analyzed by the tools we are using. 
- flag/remove data that does not make sense.

Some definitions might take this a bit farther and include normalizing data and removing outliers, but I consider data cleaning as getting data into a format where we can start actively doing 'the maths or the graphs' - whether it be statistical calculations, normalization or exploratory plots. 

Today we are going to mostly be focusing on the **data cleaning of text**. This step is crucial to taking control of your dataset and your metadata. I have included the functions I find most useful for these tasks but I encourage you to take a look at the [Strings Chapter](http://r4ds.had.co.nz/strings.html) in *R for Data Science* for an exhaustive list of functions. We have learned how to transform data into a tidy format in Lesson 2, but the prelude to transforming data is doing the grunt work of data cleaning. So let's get to it!

<div style="float:center;margin: 10px 0 10px 0" markdown="1">
![](img/cleaning.gif){width=300px}
</div>

</br>

</br>


##Intro to regular expressions


**Regular expressions**

"A God-awful and powerful language for expressing patterns to match in text or for search-and-replace. Frequently described as 'write only', because regular expressions are easier to write than to read/understand. And they are not particularly easy to write."  - Jenny Bryan

</br>

![](img/xkcd-1171-perl_problems.png)

</br>

So why do regular expressions or 'regex' get so much flak if it is so powerful for text matching?

Scary example: how to get an email in different programming languages <http://emailregex.com/>. 

Regex is definitely one of those times when it is important to annotate your code. There are many jokes related to people coming back to their code the next day and having no idea what their code means.

<div style="left;margin:0 20px 20px 0" markdown="1">
![](img/yesterdays-regex.png){width=400px} 
</div>

There are sites available to help you make up your regular expressions and validate them against text. These are usually not R specific, but they will get you close and the expression will only need a slight modification for R (like an extra backslash - described below).

Regex testers:

<https://regex101.com/>     
<https://regexr.com/>

What I would like to get across it that it is okay to google and use resources early on for regex, and that even experts still use these resources.  


</br>

<div style="float:left;margin:0 10px 10px 0" markdown="1">
![](img/80170c11996bd58e422dbb6631b73c4b.jpg){width=350px} 
</div>

<div style="float:right;margin:0 10px 10px 0" markdown="1">
![](img/regexbytrialanderror-big-smaller.png){width=350px} 
</div>

</br>

</br>

</br>

</br>
</br>

__What does the language look like?__ 

The language is based on _meta-characters_ which have a special meaning rather than their literal meaning. For example, '$' is used to match the end of a string, and this use supercedes its use as a character in a string (ie 'Joe paid \$2.99 for chips.'). 


###Matching by position

Where is the character in the string?

```{r echo = FALSE, eval = TRUE, warning = FALSE}

text_table <- data.frame(
  Expression = c("^", "$", "\\\\b", "\\\\B"),
  Meaning = c("start of string", "end of string", "empty string at either edge of a word", "empty string that is NOT at the edge of a word")
)

kable(text_table, "html") %>%
  kable_styling(full_width = F) %>%
  column_spec(1, border_right = T) %>%
  column_spec(2, italic = T, width = "40em")
```



###Quantifiers

How many times will a character appear?

```{r echo = FALSE, eval = TRUE, warning = FALSE}
text_table <- data.frame(
  Expression = c("?", "\\*","\\+", "{n}", "{n,}", "{,n}", "{n,m}"),
  Meaning = c("0 or 1", "0 or more", "1 or more", "exactly n", "at least n", "at most n", "between n and m (inclusive)")
)

kable(text_table, "html") %>%
  kable_styling(full_width = F) %>%
  column_spec(1, border_right = T) %>%
  column_spec(2, italic = T, width = "40em")
```


###Classes

What kind of character is it?

```{r echo = FALSE, eval = TRUE, warning = FALSE}
text_table <- data.frame(
  Expression = c("\\\\w, [A-z0-9], [[:alnum:]]", "\\\\d, [0-9], [[:digit:]]", "[A-z], [:alpha:]", "\\\\s, [[:space:]]", "[[:punct:]]", "[[:lower:]]", "[[:upper:]]", "\\\\W, [^A-z0-9]", "\\\\S", "\\\\D, [^0-9]"),
  Meaning = c("word characters (letters + digits)", "digits", "alphabetical characters", "space", "punctuation", "lowercase", "uppercase", "not word characters", "not space", "not digits")
)

kable(text_table, "html") %>%
  kable_styling(full_width = F) %>%
  column_spec(1, border_right = T) %>%
  column_spec(2, italic = T, width = "40em")
```


###Operators

Helper actions to match your characters.

```{r echo = FALSE, eval = TRUE, warning = FALSE}
text_table <- data.frame(
  Expression = c("|", ".", "[  ]", "[ - ]", "[^ ]", "( )"),
  Meaning = c("or", "matches any single character", "matches ANY of the characters inside the brackets", "matches a RANGE of characters inside the brackets", "matches any character EXCEPT those inside the bracket", "grouping - used for [backreferencing](https://www.regular-expressions.info/backref.html)")
)

kable(text_table, "html") %>%
  kable_styling(full_width = F) %>%
  column_spec(1, border_right = T) %>%
  column_spec(2, italic = T, width = "40em")
```

###Escape characters

Sometimes a meta-character is just a character. _Escaping_ allows you to use a character 'as is' rather than its special function. In R, regex gets evaluated as a string before a regular expression, and a backslash is used to escape the string - so you really need 2 backslashes to escape, say, a '$' sign (`"\\\$"`). 

```{r echo = FALSE, eval = TRUE, warning = FALSE}
text_table <- data.frame(
  Expression = c("\\\\"),
  Meaning = c("escape for meta-characters to be used as characters (*, $, ^, ., ?, |, \\\\, [, ], {, }, (, )). 
              Note: the backslash is also a meta-character.")
)

kable(text_table, "html") %>%
  kable_styling(full_width = F) %>%
  column_spec(1, border_right = T) %>%
  column_spec(2, italic = T, width = "40em")
```

Trouble-shooting with escaping meta-characters means adding backslashes until something works. 

![Joking/Not Joking (xkcd)](img/backslashes.png)

While you can always refer back to this lesson for making your regular expressions, you can also use this [regex cheatsheet](https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf).

</br>


##Data Cleaning with Base R (AKA What is Elon Musk up to anyways?)

Let's take this cacaphony of characters we've just learned about and perform some basic data cleaning tasks with an actual messy data set. I have scraped Elon Musk's latest tweets from Twitter. The code to do this is in the Lesson 4 file [twitter_scrape.R](https://github.com/eacton/CAGEF/blob/master/Lesson_4/twitter_scrape.R) if you are curious or want to creep someone on Twitter.

Let's read in the set of tweets, take a look at the structure of the data.

```{r }
elon_tweets_df <- read.delim("data/elon_tweets_df.txt", sep = "\t", stringsAsFactors = F)
```

The warning with EOF (end of file) within quoted string is possibly due to the fact that there are special characters (emojis, arrows, etc.) inside the cells. Let's take a look at how the file was parsed.

```{r}
str(elon_tweets_df)

```

Our end goal is going to be to look at the top 50 words in Elon Musk's tweets and make a wordcloud. I don't want urls, hastags, or other tags. I also don't want punctuation or spaces. I just want to extract the words from tweets. It might be fun to look at the top favorite tweets while we are data cleaning, so let's use `tidyverse` functions to keep the text tweets and order them by the favorited counts.
```{r}
elon_tweets_df <- elon_tweets_df %>% 
  select(text, favoriteCount) %>%
  arrange(desc(favoriteCount))

elon_tweets_df$text[1:5]
```

First, I want to remove the tags from the beginning of words. I am going to save my regex expression into an object - so we can use them again later.

What this expression says is that I want to find matches for a hastag OR an asperand ('at' symbol) followed by at least one word character. `grep` is a function that allows us to match our pattern (our expression) to a character vector. It is a good idea to do a visual inspection of your result to make sure your matches or substitutions are working the way you expected.

```{r}
tags <- "#|@\\w+"

grep(pattern = tags, x = elon_tweets_df$text)

```
We can see that `grep` returns the index of the match. We have a number of entries that include tags. We also have a number of warnings that we will return to. 

If we want to return the tweet itself instead of the index, we can use the argument `value = TRUE`.  In this case, it looks like each tweet matched does have a tag. (You will have a warning here too, I didn't print it here.)

```{r warning = FALSE}
grep(tags, elon_tweets_df$text, value = TRUE) %>% head()

```
We can then use `gsub` to replace that pattern (our tags) with nothing (an empty string).
```{r}
elon_tweets_df$text <- gsub(pattern = tags, replacement = "", elon_tweets_df$text)
```

Back to the warnings about strings being 'invalid in this locale'. Let's take a look at these strings by subsetting for the indices given.

```{r}
elon_tweets_df$text[c(10,118, 156, 219, 224)]

```

From context, it looks like these character strings have emojis in them, which have their own character codes. Why would this give us an error? Tweets are _encoded_ in UTF-16 and converted to UTF-8 when read into R. Things that have character codes get encoded differently. Here is an example of [emoji encoding](https://raw.githubusercontent.com/today-is-a-good-day/Emoticons/master/emDict.csv). Since we are going to remove anything with special character codes (ie. an apostrophe or emoji), we are going to use the `iconv` function to substitute encoded character codes that need converting with nothing (again, an empty character string). This is not something you will have to deal with on a daily basis, but character encoding is something to be aware of, especially when scraping data from the web.  

```{r}
elon_tweets_df$text <- iconv(elon_tweets_df$text, "UTF-8", "ASCII", sub = "")

elon_tweets_df$text[c(10,118, 156, 219, 224)]
```

Looking back at our problematic strings, you can see that the emojis have been removed as well as quotation marks. Our hastag and asperand would also have been encoded characters had we not already removed them.


Our next step would be to remove urls. This is a bit tricky. We could be looking for http:// or https:// followed by we don't know what (some combination of letters, numbers and forward slashes). 

We can check out which tweets have urls using `grep` as we did previously to see if we managed to match urls.

We are going to continue our pattern of using `gsub` to substitute what we don't want with an empty character string.

```{r}
url <- "http[s]?://[[:alnum:].\\/]+"

grep(url, elon_tweets_df$text, value = TRUE) %>% head()

elon_tweets_df$text <- gsub(pattern = "http[s]?://[[:alnum:].\\/]+", replacement = "", elon_tweets_df$text)

```

We can also use `grepl` to get a logical reponse for whether a tweet has a url or not. That way, if you wanted to grab all of the urls that Elon Musk suggests to visit, you can filter with `grepl` to select all of the tweets where it is TRUE that a url is present.

```{r}
grepl(url, elon_tweets_df$text) %>% head()

elon_urls <- elon_tweets_df %>% filter(grepl(url, elon_tweets_df$text))
```


Lastly, we are going to get rid of trailing spaces, numbers, and punctuation all at the same time. You can find trailing spaces at the very end of our tweet string from removing the urls.

```{r}
trail <- "[ ]+$|[0-9]*|[[:punct:]]"

grep(trail, elon_tweets_df$text, value = TRUE) %>% head()

```
We can check to see that we are picking up strings with punctutation, numbers and trailing spaces, and then we can remove them and compare our output.

```{r}
elon_tweets_df$text <- gsub(pattern = trail, replacement = "", elon_tweets_df$text)

elon_tweets_df$text[1:5]
```

It looks like everything worked except there are extra spaces from whenever a number was removed. Let's take all of the places where there are 2 or more spaces created and substitute them with just one space. 

```{r}
space <- "\\s{2,}"

grep(space, elon_tweets_df$text, value = TRUE) %>% head()

```
Again, we can check to see that we are picking up strings with extra spaces, and then replace those spaces with a single space.


```{r}
elon_tweets_df$text <- gsub(pattern = space, replacement = " ", elon_tweets_df$text)

elon_tweets_df$text[1:5]
```

It worked! 

***
__Challenge__ 


<div style="float:left;margin:0 10px 10px 0" markdown="1">
![](img/maxresdefault.jpg){width=200px}

</div>

We also have a leading whitespace where we removed a number. How would we remove that whitespace? Can you think of more than one way to do this?


</br>
</br>
</br>
</br>

***

```{r include = FALSE}
extra <- "^[ ]"

grep(extra, elon_tweets_df$text, value = TRUE) %>% head()

elon_tweets_df$text <- gsub(pattern = extra, replacement = "", elon_tweets_df$text)

elon_tweets_df$text[1:5]
```

Onwards!! Let's break the tweets down into individual words, so we can see what the most common words used are. We can use the base R function `strsplit` to do this; in this case we want to split our tweets into words using spaces. 


```{r}
strsplit(elon_tweets_df$text, split = " ") %>% head()

```
Note that the output of this function is some horrible nested list object. 

Luckily there is an `unlist` function which recursively will go through lists to simplify their elements into a vector. Let's try it and check the structure of our output. We will save this to an object called 'words'.

```{r}
unlist(strsplit(elon_tweets_df$text, split = " ")) %>% head(20)

words <- unlist(strsplit(elon_tweets_df$text, split = " "))

```

Our output is now a long character vector. This will make it much easier to count words. 

```{r}
str(words)
```

Let's take a peak at the words.
```{r}
tail(words)
```



Great! But... we missed some `\n` (newline) and `\t` (tab) characters. These are not punctuation characters.



***
__Challenge__ 


<div style="float:left;margin:0 10px 10px 0" markdown="1">
![](img/maxresdefault.jpg){width=200px}

</div>

Newline and tab characters are separating 2 words. Split these words apart and get rid of the newline character. Convert all of our character strings to lowercase (I haven't shown you how to do this, but I believe in your google-fu). Check the first and last 50 words to see if anything else is amiss.


</br>
</br>
</br>

***

```{r include = FALSE}
words <- tolower(unlist(strsplit(words, "\\n|\\t")))
#equivalent to
words <- casefold(unlist(strsplit(words, "\\n|\\t")), upper = FALSE)

words[1:50]
tail(words, 50)
```

There are still a few problems with words cutoff like 'solv', or 'flamethrower' and 'flamethrowers' being the same word, or 'north' and 'korea' belonging together for context. If we were serious about this dataset we would need to resolve these issues. We also have some html and twitter-specific tags that we will deal with shortly. 

Let's move ahead and count the number of occurences of each word and order them by frequency. We do this using our `dplyr` functions (Lesson 2).

```{r}
data.frame(words) %>% count(factor(words)) %>% arrange(desc(n))
```


Wow. We have discovered people use prepositions and conjunctions. There are also words unrelated to content but that are html jargon, or things like 'na' and 'false'. 

Luckily text mining is an area of data analytics in full force and there is a list of 'stop words' that can be used to get rid of words that are unlikely to contain useful information as part of the `tidytext` package. However, we will have to add to this list.

The data that comes with the package is called `stop_words`. We can save it as an object and take a look at its structure.
```{r}
stop_words <- stop_words
str(stop_words)
```

We can then add rows to this data frame with words our own stop words.  Remember that to `bind_rows` data frames together, the column names have to match. We can make a small data frame and call our lexicon 'custom'. Note that I have written 'custom' once - it will recycle as a character vector of length 1 to the length of the data frame.

```{r}
add_stop <- data.frame(word = c("na", "false", "href", "rel", "nofollow", "true", "amp", "twitter", "iphonea", "relnofollowtwitter", "relnofollowinstagrama"), 
                       lexicon = "custom", stringsAsFactors = FALSE)

stop_words <- bind_rows(stop_words, add_stop)
```


To remove these stop words from our list of words from tweets, we perform an anti-join (from Lesson 3).

```{r warning = FALSE}
words <- anti_join(data.frame(words), stop_words, by=c("words" = "word"))

```

Let's look at our top words by count now, and save this order.

```{r}
words %>% count(words) %>% arrange(desc(n))

words <- words %>% count(words) %>% arrange(desc(n))
```

'boring', 'falcon', 'tesla', 'rocket', 'launch','flamethrower', 'cars', 'spacex', 'tunnels', and 'mars' and 'ai' are a bit further down the list. There are a few words that look like they should be added to the 'stop words' list (dont, doesnt, didnt, im), but we'll work with this for now.

We can make a word cloud out of the top 50 words, which will be sized according to their frequency. I am starting with the first word after Elon Musk's twitter handle. The default color is black, but we can use our `viridis` package (Lesson 3) to have a pleasing color palette. It is okay if this code gives you a warning that not all words can be fit on the page, this can be changed by adjusting the `scale` argument.

```{r warning = FALSE}
words[2:51,] %>%
    with(wordcloud(words, n, ordered.colors = TRUE, colors = viridis(50), use.r.layout = TRUE))
```

***

##Data Cleaning with stringr/stringi (AKA What is Trump up to anyways?)

We are going to go through the same data cleaning process with the `stringr` package using Trump's tweets. The syntax is a little different, but it is pretty intuitive once you get started. All `stringr` functions can be found using `str_` + `Tab`. Again, we will start by loading the dataset and looking at the top 5 favorite tweets. We will remove all encoded character codes right away.

```{r}
trump_tweets_df <- read.delim("data/trump_tweets_df.txt", sep = "\t", stringsAsFactors = FALSE)
trump_tweets_df$text <- iconv(trump_tweets_df$text, "UTF-8", "ASCII", sub = "")

trump_tweets_df <- trump_tweets_df %>% select(text, favoriteCount) %>% arrange(desc(favoriteCount)) 
trump_tweets_df$text[1:5]
```

The first thing that we did was look for tags. The order of arguments are switched in `stringr` relative to the base functions. The first argument will be the character string we are searching, and the second argument will be the pattern we are matching. `str_extract` will return the index of the match, as well as the match. This is similar to `grep` when `value = TRUE`. Note that the match is extracted rather than the entire string.

```{r}
str_extract(string = trump_tweets_df$text, pattern = tags) %>% head(100)

```

`str_detect` is similar to `grepl` returning TRUE or FALSE if a match is or isn't found, respectively.

```{r}
str_detect(trump_tweets_df$text, tags)

```
Let's remove our urls as before. With the `str_replace` function we can specify our pattern and replacement, in this case an empty character string. We can see in the result that the urls have been replaced.

```{r}
str_replace_all(trump_tweets_df$text[1:10], pattern = url, replacement = "")
trump_tweets_df$text <- str_replace_all(trump_tweets_df$text, pattern = url, replacement = "")
```

Let's be ambitious and try to remove tags, numbers and punctuation characters and numbers all in one go. `str_remove` automatically replaces the match with an empty character string. It turns out the `@` and `#` are punctuation characters, so removing them is taken care of using `[[:punct:]]`. We also want to remove the metacharacter `$` (which is not considered punctuation. We aren't sure what order the numbers and punctuation might come in and square brackets allow ANY characters inside the brackets to be matched. We are not sure if there will be zero, one, or many of our target characters in a tweet, however `str_remove_all()` will remove every instance of this pattern (otherwise we would use the `*` outside the brackets to indicate 0 or more times). Looking at the output, we can see that the numbers and punctuation and dollar signs are indeed removed.

```{r}
clean_all <- "[[0-9][[:punct:]]\\$]"

trump_tweets_df$text <- str_remove_all(trump_tweets_df$text, pattern = clean_all)

trump_tweets_df$text[1:10]

```

As expected, we still have trailing spaces. Whitespace characters are not visible, but take up space. Newline characters, tabs and spaces are a form of whitespace. `stringr` has its own function for trimming whitespace, `str_trim`, which you can use to specify whether you want leading or trailing whitespace trimmed, or both.

```{r}
trump_tweets_df$text <- str_trim(trump_tweets_df$text, side = "both")

trump_tweets_df$text[1:10]
```

See how we have a couple extra spaces in the middle of some of our strings? `str_squish` will take care of that for us, leaving only a single space between words.

```{r}
trump_tweets_df$text <- str_squish(trump_tweets_df$text)

trump_tweets_df$text[1:10]
```

All that's left is to convert all characters to lowercase, and then we can see the top Trump words!

```{r}
trump_tweets_df$text <- tolower(trump_tweets_df$text)

trump_tweets_df$text[1:10]
```

To get our tweets into a word list we use `str_split`, a similar function to `strsplit`, still splitting by the spaces between words. The argument `simplify = FALSE` returns a list of character vectors which we then unlist.


```{r}
words <- unlist(str_split(trump_tweets_df$text, pattern = " ", simplify = FALSE))
str(words)
```

We can now do our `anti_join` to remove 'stop words', and tally our remaining words and order them by descending counts.


```{r warning = FALSE}
words <- anti_join(data.frame(words), stop_words, by=c("words" = "word"))

words %>% count(words) %>% arrange(desc(n)) 

words <- words %>% count(words) %>% arrange(desc(n))

```

Hmmm... it looks like we have those html tags in a different format. It's interesting to note these little variations because no matter how much you try to automate your analysis there is always going to be something from your new dataset that didn't fit with your old dataset. This is why we need these data wrangling skills. Even though some packages may have been created to help us on our way, they can't possibly cover every case. 

<div style="float:left;margin:0 10px 10px 0" markdown="1">
![](img/1467481_240434926124232_550310772_n.jpg){width=500px}


</div>

</br>
</br>     
</br>
</br>
</br>
</br>     
</br>
</br>
</br>

</br>
</br>
</br>
</br>     
</br>
</br>



We could go back and get rid of some of characters such as `<`, however we don't want to lose sight that these are html tags and not words (the tweet was from an ipad or iphone, the 'word' isn't being mentioned). We will instead add these to our stop words list.

```{r}
add_stop <- data.frame(word = c("rel=nofollow>twitter", "href=", "iphone<a>", "<a","dont", "$", "href=downloadipad", "ipad<a>" ), 
                       lexicon = "custom", stringsAsFactors = FALSE)


stop_words <- bind_rows(stop_words, add_stop)

```

We then perform an `anti_join` with our new list and view the updated version. (`words` was already sorted and so we do not need to do that again.)

```{r warning = FALSE}
words <- anti_join(data.frame(words), stop_words, by=c("words" = "word"))
words[1:50,]
```
'president', 'people', 'fake', 'news', 'daca', democrats', 'jobs', 'obama', 'border', 'fbi', 'collusion', 'russia', 'wall', 'mexico' and further down is 'crooked' and 'hillary'. 

Trump's wordcloud minus his twitter handle.
```{r}
words[2:51,] %>%
    with(wordcloud(words, n, ordered.colors = TRUE, c(3,.5),colors = viridis(50), use.r.layout = TRUE))
```

***
__Challenge__ 
$

<div style="float:left;margin:0 10px 10px 0" markdown="1">
![](img/maxresdefault.jpg){width=200px}

</div>

Pick one of the other tweet data sets: 

  [Bill Nye](https://github.com/eacton/CAGEF/blob/master/Lesson_4/data/nye_tweets_df.txt), [Justin Trudeau](https://github.com/eacton/CAGEF/blob/master/Lesson_4/data/jt_tweets_df.txt), [The Daily Show](https://github.com/eacton/CAGEF/blob/master/Lesson_4/data/daily_tweets_df.txt), [Katy Perry](https://github.com/eacton/CAGEF/blob/master/Lesson_4/data/katy_tweets_df.txt), [Jimmy Fallon](https://github.com/eacton/CAGEF/blob/master/Lesson_4/data/jimmy_tweets_df.txt), [Stephen Colbert](https://github.com/eacton/CAGEF/blob/master/Lesson_4/data/colbert_tweets_df.txt).      


Clean it. Remove all of the stop words. Were there any other challenges compared to the previous datasets? Did you have to create new stop words or do extra regex? Make a wordcloud of the top 50 words.


</br>

***


##Rmarkdown and knitr

Markdown is a plain text formatting syntax. It allows one to easily add headings, lists, links, highlighting, bullets, images, equations, tables and text styling. R has a modified version of markdown (R markdown) where you can embed code _chunks_ into a document. Combined with the `knitr` package, this allows us to make reproducible documents. The awesomeness of this combination allows us to annotate our code while we work in a format that is immediately presentable. With the click of a button our script (or notebook) can be converted into a word document, a pdf, or a shareable html hyperlink. Other formats such as slides are also possible. This means that you only have to do your work once - you don't have to have your code, generate images, paste them into powerpoint or word - get asked to change something, rerun the code, get the figure, change your powerpoint... everything is all in one place - you make your change, knit your document and you are done - you don't have to leave RStudio.

Markdown and `knitr` can also save us time scrolling through our code - a table of contents can be added, code chunks can be named - both of which allow us to jump around and navigate our script easily. It takes a bit of discipline to get started, but I believe that you will see the benefits pretty quickly. There are even markdown templates for submitting to different journals or writing a thesis.

For this lesson I suggest going to `Tools -> Global Options -> R Markdown -> Show output preview in`: and change from 'Window' to 'Viewer Pane', then click 'Apply' and 'OK'. This will allow us to see our new document in the same window (in the Viewer) instead of switching back and forth between our code and the document in a separate window. 

###R markdown syntax

Let's start by creating a new R markdown document by going to `File -> New File -> R Markdown`. R will ask you for the Title of your document, the Author and whether you want to _render_ your markdown as html, pdf, or as a word document. There is also the option to make slides (under Presentation), a Shiny app, or use Templates for package documentation or GitHub (there is a git version of markdown that differs slightly from R markdown).  html renders faster than the other formats, so we will stick with that and click OK. R immediately puts a yaml (yet another markup language, yaml ain't markup language) header which tells you how the file is configured. 

    ---
    title: "R markdown Lesson"
    author: "EA"
    date: "April 11, 2018"
    output: html_document
    ---

As you can see, the date has been added (this is the date of script creation and does not update when the script is rendered), and the output is going to be html. Note also that your document is Untitled. We can go ahead and save that as 'Rmarkdown_Lesson' and see that the file is saved as a `.Rmd` file.

We can see that R has a little demo set up already which we are going to work with. The new element in .Rmd files are these code _chunks_ denoted by a set of 3 backticks, followed by `{r}`, some code and a closing set of 3 backticks. The keyboard shortcut for generating a code chunk is `CTRL + ALT + I`.

\`\`\`\{ r name_of_chunk,    code_options\}     

type code here

\`\`\`

These code chunks have been named 'setup', 'cars', and 'pressure', and at the bottom left the source pane (or if you go to `Code -> Jump To...` it will pop up) you can navigate between these code chunks. This is a helpful feature as your code gets a bit longer. 

You have probably also noticed that in this navigation bar the bolded items correspond to the title of the document and the text with a leading `##`. Hashtags (when outside code chunks, ie. in markdown language) denote headers. The number of hashtags denotes the level of the header as well as the size. For example, the title is a first level header, and 'R Markdown' and 'Including Plots' are second level headers, and will also be smaller than the title. Let's go ahead and _knit_ our document by clicking the Knit button. Note that you can change from your default output choice (html) to Word or pdf in the dropdown menu.

Let's look at how markdown is rendered in our html file. We can see that to **bold** text you can have two asteriks (**) or two underscores (__) on either side of the text. 

You can insert a url by typing the url inside of arrow brackets <<http://rmarkdown.rstudio.com>>. If you want the link to be named 'rmarkdown' can format it like this [rmarkdown] (http://rmarkdown.rstudio.com) without the space inbetween the name and the url. Replace the url with the named version and click knit to see the difference in the output.

The other emphasized text in this document has a grey background. This is achieved by flanking the text with \``backticks`\`. The code in the 'cars' chunk has a grey background and the evaluated output is in white. This is standard in the R community for the code to be grey and the output to be white and commented. Inline code can be written by using `{r 2+2}` and it will also have a grey background. 


*Rmarkdown*
```{r results = 'asis', eval = FALSE}
To make a bulleted list:

* you need to leave a line before 
* the text and the start of your bullets and a 
* space between the asterik (bullet) and your text.
```


*Rendered*

To make a bulleted list:

* you need to leave a line before 
* the text and the start of your bullets and a 
* space between the asterik (bullet) and your text.


*Rmarkdown + Rendered*

To make a numbered list:

1. you need to leave a line before 
2. the text and the start of your numbers and a 
3. space between the numbers and your text.



*Rmarkdown*

```{r results = 'asis', eval=FALSE}
To make a super-cool updatable numbered list:

1. you need to do the above as with numbered lists
1. but all of the numbers are numbered '1.' 
1. you can now add, remove and reorder and your numbers will update.


```



*Rendered*

To make a super-cool updatable numbered list:

1. you need to do the above for numbered lists
1. but all of the numbers are numbered '1.' 
1. you can now add, remove and reorder and your numbers will update.


*Rmarkdown*

```{r results = 'asis', eval=FALSE}
If ever your text
is clumping together
when you do not expect it to,
remember that you need 5 spaces at at the end of a line
to start a new line.
```

*Rendered*

If ever your text
is clumping together
when you do not expect it to,
remember that you need 5 spaces at at the end of a line
for a new line to start.

*Rmarkdown*

```{r results = 'asis', eval=FALSE}
    A text box can be created by indenting with Tab twice.
```

*Rendered*

    A text box can be created by indenting with Tab twice.

*Rmarkdown*

A line across the page is '***'.

*Rendered*

***

###Knitr Chunk Options

You might notice that while there are 3 code chunks in this example, there is one line of code visible in the rendered version, and 2 outputs (summary statistics and a plot). Why don't we see the code used to make the plot? Code chunks have options that can be entered to modify their output. In this case the inclusion of `echo = FALSE` prevents the code from being included, but the code is still run and so the plot is still produced. Try changing the code chunk option for the plot to `eval = FALSE`. What happened?

`{r pressure, eval = FALSE}`
```{r pressure, eval = FALSE}
plot(pressure)
```

`eval = FALSE` means the code is shown, but not evaluated. 

You can specify which lines of code in a chunk get evaluated. For example if you had 5 lines of code, but only wanted to run the first and third, you could use `eval = c(1,3)`. This feature of using a vector of position to specify code is available for other chunk options such as `echo`. 

The first code chunk in this script is setting default options for all code chunks to be used in this script.  In this case `echo = TRUE` was set as a default chunk option. The option `include = FALSE` for this chunk means that the code will not be included, but the code will still be run. The difference between this command and `echo = FALSE` is that the output of the code is NOT shown. 

Let's look at what the default chunk options are - this way we will be able to see all of the options available to change.

`{r setup, echo = TRUE }`
```{r setup, echo = TRUE }
str(knitr::opts_chunk$get())
```
There are a ton of options here. You can guess that some of them have to do with default figure sizes and labels, and there a bunch of options that are not specified (NULL). 

As far as setting chunk options at the beginning of a script goes, consider the following:

`{r}`
```{r}
library(tidyverse)

```

If we are creating a document, we may want to show what package we used, but we don't want all of the package startup messages. With `message = FALSE` the code will run and be shown but any messages generated will be suppressed.

`{r message = FALSE}`
```{r message = FALSE}
library(tidyverse)

```

I could use `include = FALSE, message = FALSE` if I wanted the library loaded but didn't want the code or its message to be seen.


If I wanted to do something silly like add 6 to every summary value (in truth each of these summary values is a character) it would generate an error. A document will not be rendered if it has an error in it. Try to knit the document with this code.

`{r error = TRUE}`
```{r error = TRUE}
summary(cars) + 6

```
Adding the option `error = TRUE` allows the document to be rendered despite the error. The error message will still be shown.

***
__Challenge__ 


<div style="float:left;margin:0 10px 10px 0" markdown="1">
![](img/maxresdefault.jpg){width=200px}

</div>


For the code chunk containing `plot(pressure)`: How would you show just the code (not the plot) and not run the code?  How would you show just the code and have the output run but not show it? How would you change the background color to something other than gray? You can use help pages, Google, or the knitr documentation found here: <https://yihui.name/knitr/options/#chunk_options>

</br>
</br>

***

###Caching


In the 'Run' dropdown menu (found in the top right of the source pane), there are various options for running your current chunk - `CTRL+SHIFT+ENTER`, the next chunk - `CTRL+ALT+N`, all chunks above - `CTRL+ALT+P`, all chunks below, and another few options. This allows you to assess the upstream and downstream consequences of a change in your code. `knitr` also has the option to _cache_ the output of code chunks by setting the option `cache = TRUE`. A folder will be created that saves the output of your chunk in a data file. `knitr` accesses the cache and loads the result from the last time the chunk was run without recalculating values.   This can be very useful if the code in a particular chunk takes awhile to run and you are assessing changes unrelated to that code,  or changes after that code. 

For example, if your document isn't knitting because of an error at line 200 and your time intensive code runs at line 100, you can cache the line 100 chunk and troubleshoot the line 200 code without having to wait for this earlier chunk to run again. The caching caveat is that changing anything in earlier code (at line 50 in this example) that your cached chunk depends on would not be appropriately updated (ie. the code at line 100 would still not change). Therefore it is important to be conscious of what you are caching and where changes are occurring in your script. You should uncache your code chunk for the final rendering to make sure there haven't been any unforseen changes to your document. 

This is a simplified explanation of caching and more details can be found in the [knitr manual](https://www.cs.bham.ac.uk/~axj/pub/teaching/2016-7/stats/knitr-manual.pdf) and its [cache demo](https://yihui.name/knitr/demo/cache/).

__Playing with Caching__

__Note to David: These caching scenarios will partly be on Socrative. There will be multiple choice answers about the output and students vote on which is correct. This will allow me to track comprehension across the series. Therefore this text will be in the online version of the notes.__

__Scenario 1__

With our current .Rmd file, let's say the 'summary' chunk took awhile to run. Let's add `cache = TRUE` to its chunk options. Make sure `eval = FALSE` has been removed from the options in your 'plot' chunk. Knit the document and take a look. 

We are now going to change the plot chunk to depend on the cars dataset.


`{r cars}`
```{r eval = FALSE}
plot(cars$speed, cars$dist)
```

We can knit the document again, and assume that we saved ourselves the time cost of running the second chunk when we are just updating a plot. 

Now let's put a chunk _before_ our cached chunk called 'new row'.  Add a point to the cars dataset using `dplyr`'s `bind_rows`. Note that I haven't loaded the entire `dplyr` package, but rather have just made a call to one specific function. Knit the document again.  

`{r new row}`
```{r eval = FALSE}
cars <- dplyr::bind_rows(cars, c(speed = 50, dist = 200))
```

This point is an outlier, and the change can be seen on the output of our plot. However, our summary also depends on the cars dataset and has not been updated (ie. the maximum distance is still 120 km). If the code of the cached chunk does not change, the chunk is not rerun.

__Scenario 2__

If I change the cached chunk, say, by adding another outlier data point - what do you think will happen? Add this point and knit again.

`{r cars, message = FALSE, cache = TRUE}`
```{r message = FALSE, cache = TRUE, eval = FALSE}
cars <- dplyr::bind_rows(cars, c(speed = 100, dist = 300))
summary(cars)
```

Since the code in the cache changed, its values were recalculated and the max distance in the summary has changed to 300 km. Changes in the the cached chunk are evaluated and passed on to the next code block. The previous cache values are deleted and replaced with the current values. The plot, which depends on the cars dataset, now has a point at dist = 300 km AND dist = 200 km.


__Scenario 3__


If I change cars (with the outlier point) in the 'new row' chunk back to cars (without the outlier point) and knit the document, what do you expect to happen?

`{r new row}`
```{r new row}
cars <- cars
```

Since the code for our cached chunk doesn't change its values are loaded from the cache. This includes the cars dataset, and so the plot, downstream of our cache, does not reflect the change to the 'new row' chunk.


__Scenario 4__

What if we instead comment the outlier point in the cached chunk and keep the extra point from the first chunk? 

`{r new row}`
```{r }
cars <- cars
cars <- dplyr::bind_rows(cars, c(speed = 50, dist = 200))
```


`{r cars, message = FALSE, cache = TRUE}`
```{r message = FALSE, cache = TRUE, eval = FALSE}
#cars <- dplyr::bind_rows(cars, c(speed = 100, dist = 300))
summary(cars)
```

The code in the cache changed and so the summary was recalcuated using the earlier chunk. Both the summary and the plot show the 200 km distance value.

__Scenario 5__

What do you expect to happen if both outlier points are commented out and we restart the R session and knit the document? What is the max dist value in the summary table? What about the plot - does it contain the outlier?

`{r new row}`
```{r }
cars <- cars
#cars <- dplyr::bind_rows(cars, c(speed = 50, dist = 200))
```

`{r cars, message = FALSE, cache = TRUE}`
```{r message = FALSE, cache = TRUE}
#cars <- dplyr::bind_rows(cars, c(speed = 100, dist = 300))
summary(cars)
```

Since the code in the cached chunk did not changed, it accesses the summary from the cached data, which has 200 km as the maximum value. However, the plot needs to be created from scratch and uses cars from the first chunk so has a max of 120 km.

__Scenario 6__

What if we remove `cache=TRUE` from our code chunk?

`{r cars, message = FALSE}`
```{r message = FALSE, eval = FALSE}
#cars <- dplyr::bind_rows(cars, c(speed = 100, dist = 300))
summary(cars)
```

The cache is no longer accessed. The summary table and the plot reflect the original cars dataset with a max of 120 km.

__Scenario 7__

What if we add `cache=TRUE` to our code chunk again?

`{r cars, message = FALSE, cache = TRUE}`
```{r message = FALSE, cache = TRUE, eval = FALSE}
#cars <- dplyr::bind_rows(cars, c(speed = 100, dist = 300))
summary(cars)
```

The cache files are kept on the computer. The summary table reflects the previously cached value of 200 km, the plot is calculated from the first chunk and has a max of 120 km. The cache will stay this value until it is updated. To start with a fresh cache in the same directory you need to delete your cached files. 

###Tables

You can make table in markdown, but it is kind of annoying compared to using the `kable` package to make tables in `knitr`. Making tables in markdown involves using a series of pipes (|) to make columns and hypens (-) to make column headers. Here is an example of a markdown table. Here is an example of how to format markdown tables: <https://help.github.com/articles/organizing-information-with-tables/>. Go nuts. 

      
      |Summary      | Values|
      |-------------|-------|
      | correlation |   `r cor(cars$speed, cars$dist)`|
      | mean km/h   |    `r mean(cars$speed)`|
      | mean km     |    `r mean(cars$dist)`|
     



|Summary      | Values|     
|-------------|--------|     
| correlation |   `r cor(cars$speed, cars$dist)`|     
| mean km/h   |    `r mean(cars$speed)`|     
| mean km     |    `r mean(cars$dist)`|     
      
####Rounding Values

The value for 'correlation' in this table was really the output of 'r cor(cars\$speed, cars\$dist)' which gives the value of `r cor(cars$speed, cars$dist)`. To round values is as easy as selecting the number of significant digits using the `round` function: 'r round(cor(cars\$speed, cars\$dist), 2)' would then give `r round(cor(cars$speed, cars$dist), 2)` two significant digits. The `kable` tables were are going to work with have a digits argument which gets passed to the `round`  function (usage: `digits = 2`).

####Kable tables in knitr

In this lesson we are going to focus on nice looking `kable` tables, which are easily customizable through the [kableExtra](https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html#getting_started) package. The summary stats from cars is a table. However, it doesn't look very good. Here is a reminder of the default output. 

```{r}
summary(cars)
```

This summary is an odd table object. If we turn it into a data frame it will be easier to work with.

```{r}
dat <- data.frame(speed = summary(cars)[,1], distance = summary(cars)[,2])

```

Here, a simple call to `kable` creates a table styled similar to the above markdown. 

```{r}
kable(dat)
```

A variety of styles are offered with simple syntax. Here we have striped rows which highlight when you hover over them, the table width iis the length of the longest text and not across the whole page, and the table is left-aligned.

```{r}
kable(dat, "html")  %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, position = "left")
```


You can also have the table move to the left or right side of your document so that text or a figure could be included beside it. In this case, having the table _float_ right allows for text or images to be formatted on the left side of the page. For example, we could change the figure size as well and have a figure and table side-by-side in our document.


```{r  message = FALSE, cache = TRUE}

kable(dat, "html", escape = F)  %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, position = "float_right") 
```


```{r echo = FALSE, fig.width=6, fig.height = 5, out.width='50%', out.height='50%'}
par(mar = c(4,4,1,0))
plot(cars$speed, cars$dist)
```


In this case, I shrank the plot using `out.width` and `out.height` so that it would fit beside our table.

`{r pressure,  echo = FALSE, fig.width=6, fig.height = 5, out.width='50%', out.height='50%'}`

```{r eval = FALSE}
par(mar = c(4,4,1,0)) #adjusting figure margins
plot(cars$speed, cars$dist)
```


</br>





</br>

#####Customize with highlighting and borders. 

For this table black lines have been specified as column borders. A row was specified to be highlighted by a yellow background as well as to have the text emphasized in bold. `escape = FALSE` escapes specical characters. In this case it interferes with our column titles.

```{r}
kable(dat, "html") %>%
       kable_styling("striped", full_width = FALSE) %>%
       column_spec(1:2, border_right = TRUE, border_left = TRUE) %>%
       row_spec(3, bold = T, color = "black", background = "yellow") 
```

#####Add footnotes.

Footnotes can be added to a table using symbols or alphabet markers for flags. 

This is a good time to learn more useful data cleaning functions `paste` and `paste0`. These are made to join or 'paste' string characters together. In this case we want to take a character string (the title of each column of our data frame) and add a footnote symbol to it to denote units. Can you tell what the difference is between the 2 functions by the output?
```{r message = FALSE}
colnames(dat)[1] <- paste("car_", colnames(dat)[1], footnote_marker_symbol(1))
colnames(dat)[2] <- paste0("car_", colnames(dat)[2], footnote_marker_alphabet(1))

```
Escape has been changed to FALSE so that the html encoding of our superscript is not escaped. The legend for the footnote symbol or character below the table is also added in out `kable` call.
```{r}
kable(dat, "html", escape = FALSE) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1:2, border_right = TRUE, border_left = TRUE) %>%
  row_spec(3, bold = T, color = "black", background = "yellow") %>%
  footnote(symbol = "kilometers per hour", alphabet = "kilometers;") 
```


###Adding Images to your Document

To add pictures to your document:

   
    ! [#caption (optional)]   (#directory/file)     {#size (optional)}     
`![knitr - get it?](img/kitten-with-string.pjg){width=400px}`     


![knitr - get it?](img/kitten-with-string.jpg){width=400px}

Minimum syntax to add an image (no caption, default image size):     
`![](img/kitten-with-string.jpg)`

###Table of contents

This is the yaml header including the table of contents (toc) for the lessson. It is as simple as writing `toc = TRUE` under the output for the document type you are using and then specifying what level of headers (remember our hashtags) you would like to include in the toc. I am keeping 1st, 2nd, and 3rd level headers in this example. If I had a 4th level header, it would still be larger than my text, but it will not show up in my table of contents. The toc creates a hyperlink to each section for the user to navigate the document. `CTRL+SHIFT+O` opens the document outline which allows navigation to these sections while coding.  


    ---
    title: "Lesson 4 - Of Data Cleaning and Documentation - Conquer Regular Expressions, Use R markdown and knitr to make PDFs, and Challenge yourself with a 'Real' Dataset"
    output: 
      html_document:
              keep_md: yes
              toc: TRUE
              toc_depth: 3
      html_notebook:
              toc: TRUE
              toc_depth: 3
    ---


You may have noticed the blue button that kind of looks like an eyeball in the top right corner of the Viewer Pane as well as the Source Pane with a dropdown that says 'Publish'. If you are super-proud of your work, you can post your rendered document for free, for the world to see at [Rpubs](https://rpubs.com/). It can be interesting to see what other people in the R community have been working on as well.

###Slides

Slideshows can also be made fairly simply in R markdown. Go to `File -> New File -> R Presentation` and create an .RPres file. Slides are separated by a series of equals lines (===) and the title of the slide is just above these lines.


      First Slide
      ========================================================

      For more details on authoring R presentations please visit <https://support.rstudio.com/hc/en-us/articles/200486468>.

      - Bullet 1
      - Bullet 2
      - Bullet 3

    Slide With Code
    ========================================================

    ```{r}
    summary(cars)
    ```

    Slide With Plot
    ========================================================

    ```{r eval = FALSE}
    plot(cars)
    ```


If you click on 'Preview' in the Source Pane, a Presentation Tab will open in the Environment Pane with a a slideshow that you can toggle through. In that Pane under 'More' you can also 'View in Browser' or 'Save As Webpage', which is the common way these slides get presented.

I really just wanted to show you that these slides exist. Depending on what you are presenting, this could be a quick alternative to Powerpoint if you are need to present some code. Again, these are customizable <https://rmarkdown.rstudio.com/ioslides_presentation_format.html>.

If you are interested in a separate tutorial on making and customizing ioslides or the fancier [Slidify](https://www.jvcasillas.com/slidify_tutorial) slides, please leave a comment in the Lesson 4 survey (https://www.surveymonkey.com/r/PVHDKDB).


***

##A Real Messy Dataset

I looked for a messy dataset for data cleaning and found it in a blog titled:     
["Biologists: this is why bioinformaticians hate you..."](http://www.opiniomics.org/biologists-this-is-why-bioinformaticians-hate-you/) 
     
The main and common issue with this dataset is that when data entry was done there was no _structured vocabulary_; people could type whatever they wanted into free text answer boxes instead of using dropdown menus with limited options, giving an error if something is formatted incorrectly, or stipulating some rules (ie. must be all lowercase, uppercase, no numbers, spacing, etc). 

I must admit I have been guilty of messing with people who have made databases without rules. For example, giving an emergency contact, there was a line to input 'Relationship', which could easily have been a dropdown menu: 'parent, partner, friend, other'. Instead I was allowed to write in a free text box 'lifelong kindred spirit, soulmate and doggy-daddy'. I don't think anyone here was trying to be a nuisance, this messy data is just a consequence of poor data collection. 

    


__Challenge:__      

This is [Wellcome Trust APC dataset](https://github.com/eacton/CAGEF/blob/master/Lesson_4/data/University%20returns_for_figshare_FINAL.xlsx) on the costs of open access publishing by providing article processing charge (APC) data. 

https://figshare.com/articles/Wellcome_Trust_APC_spend_2012_13_data_file/963054

<div style="float:right;margin:0 10px 10px 0" markdown="1">
![](img/yougotthis.jpg){width=200px}
</div>


What I want to know is: 

  1. List 3 problems with this dataset that require data cleaning.
  1. What is the mean cost of publishing for the top 3 most popular publishers? 
  1. What is the number of publications by PLOS One in dataset?                 
  1. Convert sterling to CAD. What is the median cost of publishing with Elsevier in CAD?
  1. Annotate your data cleaning efforts and answers to these questions in an .Rmd file. Knit your final answers to pdf.

The route I suggest to take in answering these question is:

* Inspect your dataset. Are the data types what you expect?
* Identify any immediate problems. (Answer Question #1)
* Clean up column names.
* Data clean the publisher column.
    - convert all entries to lowercase
    - correct typos
    - correct multiple names for a publisher to one name
    - remove newline characters and trailing whitespace
* Answer Questions #2-5



There is a [README](https://github.com/eacton/CAGEF/blob/master/Lesson_4/data/Readme_file.docx) file to go with this spreadsheet if you have questions about the data fields.  

</br>


The blogger's opinion of cleaning this dataset:

_'I now have no hair left; I’ve torn it all out.  My teeth are just stumps from excessive gnashing.  My faith in humanity has been destroyed!'_

Don't get to this point. The dataset doesn't need to be perfect. No datasets are 100% clean. Just do what you gotta do to answer these questions.  

We can talk about how this went at the beginnning of next week's lesson.

***



   
__Resources:__     
<http://stat545.com/block022_regular-expression.html>     
<http://stat545.com/block027_regular-expressions.html>     
<http://stat545.com/block028_character-data.html>     
<http://r4ds.had.co.nz/strings.html>
<http://www.gastonsanchez.com/Handling_and_Processing_Strings_in_R.pdf>     
<http://varianceexplained.org/r/trump-tweets/>     
<http://www.opiniomics.org/biologists-this-is-why-bioinformaticians-hate-you/>     
<https://figshare.com/articles/Wellcome_Trust_APC_spend_2012_13_data_file/963054>     
<http://www.datacommunitydc.org/blog/2013/08/fantastic-presentations-from-r-using-slidify-and-rcharts/>     
<https://github.com/rdpeng/cachesweave/blob/master/inst/doc/cacheSweave.Rnw>     
<http://emailregex.com/>     
<https://regex101.com/>     
<https://regexr.com/>     
<https://www.regular-expressions.info/backref.html>     
<https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf>     
<https://raw.githubusercontent.com/today-is-a-good-day/Emoticons/master/emDict.csv>     
<http://rmarkdown.rstudio.com>     
<https://yihui.name/knitr/options/#chunk_options>  
<https://www.cs.bham.ac.uk/~axj/pub/teaching/2016-7/stats/knitr-manual.pdf>     
<https://yihui.name/knitr/demo/cache/>  
<https://help.github.com/articles/organizing-information-with-tables/>      
<https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html#getting_started>     
<https://rmarkdown.rstudio.com/ioslides_presentation_format.html>     
<https://rpubs.com/>      
<https://www.jvcasillas.com/slidify_tutorial>     
<http://r4ds.had.co.nz/strings.html>


#Post-Lesson Assessment
***

Your feedback is essential to help the next cohort of trainees. Please take a minute to complete the following short survey:
https://www.surveymonkey.com/r/PVHDKDB

</br>

***

</br>

Thanks for coming!!!

![](img/rstudio-bomb.png){width=300px}


